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Abstract 

Nowadays data sets are available in very complex and heterogeneous 
ways. Mining of such data collections is essential to support many real- 
world applications ranging from healthcare to marketing. In this work, 
we focus on the analysis of “complex” sequential data by means of inter¬ 
esting sequential patterns. We approach the problem using the elegant 
mathematical framework of Formal Concept Analysis (FCA) and its ex¬ 
tension based on “pattern structures”. Pattern structures are used for 
mining complex data (such as sequences or graphs) and are based on a 
subsumption operation, which in our case is defined with respect to the 
partial order on sequences. We show how pattern structures along with 
projections (i.e., a data reduction of sequential structures), are able to enu¬ 
merate more meaningful patterns and increase the computing efficiency of 
the approach. Finally, we show the applicability of the presented method 
for discovering and analyzing interesting patient patterns from a French 
healthcare data set on cancer. The quantitative and qualitative results 
(with annotations and analysis from a physician) are reported in this use 
case which is the main motivation for this work. 

Keywords: data mining; formal concept analysis; pattern structures; 
projections; sequences; sequential data. 


1 Introduction 

Sequence data is present and used in many applications. Mining sequential pat¬ 
terns from sequence data has become an important data mining task. In the 
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last two decades, the main emphasis has been on developing efficient mining 


2001a, 

Yan et al. 

2003 

Ding et al., 2009, Ra'issi et al. 


2000 Pei et al. 


algorithms and effective pattern representations Han et al. 

20081 . However, one 


problem with traditional sequential pattern mining algorithms (and generally 
with all pattern enumeration algorithms) is that they generate a large number 
of frequent sequences while a few of them are truly relevant. To tackle this 
challenge, recent studies try to enumerate patterns using some alternative in¬ 
terestingness measures or by sampling representative patterns. A general idea in 
finding statistically significant patterns is to extract patterns whose characteris¬ 
tics for a given measure, such as frequency, strongly deviates from its expected 
value under a null model, i.e. the value expected by the distribution of all data. 
In this work, we focus on complementing the statistical approaches with a sound 
algebraic approach trying to answer the following question: can we develop a 
framework for enumerating only relevant patterns based on data lattices and its 
associated measures? 

The above question can be answered by addressing the problem of analyz¬ 
ing sequential data using the framework of Formal Concept Analysis (FCA), 
a mathematical approach to data analysis Ganter and Wille 1999 , and pat¬ 


tern structures, an extension of FCA that handles complex data Ganter and 


Kuznetsov 2001 . To analyze a dataset of “complex” sequences while avoiding 


the classical efficiency bottlenecks, we introduce and explain the usage of projec¬ 
tions, which are mathematical mappings for defining approximations. Projec¬ 
tions for sequences allow one to reduce the computational costs and the volume 
of enumerated patterns, avoiding the infamous “pattern flooding”. In addition, 
we provide and discuss several measures, such as stability, to rank patterns with 
respect to their “interestingness”, giving an expert order in which the patterns 
may be efficiently analyzed. 

In this paper, we develop a novel, rigorous and efficient approach for work¬ 
ing with sequential pattern structures in formal concept analysis. The main 
contributions of this work can be summarized as follows: 


• Pattern structure specification and analysis. We propose a novel way of 
dealing with sequences based on complex alphabets by mapping them to 
pattern structures. The genericity power provided by the pattern struc¬ 
tures allows our approach to be directly instantiated with state-of-the-art 
FCA algorithms, making the final implementation flexible, accurate and 
scalable. 

• “Projections” for sequential pattern structures. Projections significantly 
decrease the number of patterns, while preserving the most interesting 
ones for an expert. Projections are built to answer questions that an 
expert may have. Moreover, combinations of projections and concept sta¬ 
bility index provide an efficient tool for the analysis of complex sequential 
datasets. The second advantage of projections is its ability to significantly 
decrease the complexity of a problem, saving thus computational time. 

• Experimental evaluations. We evaluate our approach on real sequence 


2 




















































Table 1: A toy FCA context. 



mi 

TO 2 

TO 3 

7714 

5i 

X 



X 

92. 



X 

X 

91, 


X 



94 



X 

X 


dataset of a regional healthcare system. The data set contains ordered 
sets of hospitalizations for cancer patients with information about the 
hospitals they visited, causes for the hospitalizations and medical proce¬ 
dures. These ordered sets are considered as sequences. The experiments 
reveal interesting (from a medical point of view) and useful patterns, and 
show the feasibility and the efficiency of our approach. 


This paper is an extension of the work presented at CL A’14 conference Buz¬ 


makov et al. 2013 . The main differences w.r.t. the CLAT4 paper are a more 


complete explanation of the mathematical framework and a new experimental 
part evaluating different aspects of the introduced framework. 

The paper is organized as follows. Sectionj^introduces formal concept analy¬ 
sis and pattern structures. The specification of pattern structures for the case of 
sequences is presented in Sectionj^ Section [^describes projections of sequential 
pattern structures followed in Sectionj^by the evaluation and experimentations. 
Finally, related works are discussed before concluding the paper. 


2 FCA and pattern structures 


2.1 Formal concept analysis 

FCA is a formalism that can be used for guiding data analysis and knowledge 
discovery [Canter and Wille , 1999 . FCA starts with a formal context and builds 
a set of formal concepts organized within a concept lattice. A formal context is 
a triple (G, M, /), where G is a set of objects, M is a set of attributes and I is 
a relation between G and M, I C G x M. In Table a cross table for a formal 
context is shown. A Galois connection between G and M is defined as follows: 


A' = {m € M I V5 e A,(5 ,to) G/}, ACG 

B' = {g & A \ Vto G M, {g, to) G /}, B C M 

The Galois connection maps a set of objects to the maximal set of attributes 
shared by all objects and reciprocally. For example, {gi,g 2 }' = {^ 4 }, while 
{to 4 }^ = {gi, g2,94}, i-S. the set {91,92} is not maximal. Given a set of objects 
A, we say that A! is the description of A. 
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Figure 1: Concept Lattice for the toy context 


Definition 1 . A formal concept is a pair {A,B), where A C G is a subset of 
objects, B C- M is a subset of attributes, such that A' = B and A = B', where 
A is called the extent of the concept, and B is called the intent of the concept. 

A formal concept corresponds to a pair of maximal sets of objects and at¬ 
tributes, i.e. it is not possible to add an object or an attribute to the concept 
without violating the maximality property. For example a pair {{gi , 52 ; 54 } ; {^ 4 }) 
is a formal concept. Formal concepts can be partially ordered w.r.t. the 
extent inclusion (dually, intent inclusion). For example, ({gi} ; {mi, 7714 }) < 
(iffij 52 , < 74 } , {W 4 }). This partial order of concepts is shown in Figure]^ The 
number of formal concepts for a given context can be exponential w.r.t. the 
cardinality of set of objects or set of attributes. It is easy to see that for context 
{G,G,Ig), where Iq = {{x,y) \ x G G,y G G,x ^ y}, the number of concepts 
is equal to . 


2.2 Stability index of a concept 


The number of concepts in a lattice for real-world tasks can be large. To find 
the most interesting subset of concepts, different measures can be used such as 
the stability of the concept [Kuznetsov 2007 or the concept probability and 


separation Klimushkin et al. 2010 . These measures help extracting the most 


interesting concepts. However, the last ones are less reliable in noisy data. 


Definition 2. Given a concept c, the concept stability Stab(c) ofc is the relative 
number of subsets of the concept extent (denoted Ext (c)), whose description, i.e. 
the result of {■)', is equal to the concept intent (denoted Int(c)/ 


Stab(c) 


|{s e p(Ext(c)) I s' = Int(c)}| 
|p(Ext(c))| 


( 1 ) 


Here p(F’) is the powerset of P. Stability measures how a concept depends on 
objects in its extent. The larger the stability is the more combinations of objects 
can be deleted from the context without affecting the intent of the concept, i.e. 
the intent of the most stable concepts is likely to be a characteristic pattern of a 
given phenomenon and not an artifact of a dataset. Of course, stable concepts 
still depend on the dataset, and, consequently some important information can 


4 























Table 2: A toy formal context 




Figure 2: Concept Lattice for the context in Tablej^with corresponding stability 
indexes. 


be contained in the unstable concepts. However, the stability can be considered 
as a good heuristic for selecting concepts because the more stable the concept 
is the less it depends on the given dataset w.r.t. to object removal. 

Example 1. Figure\^shows a lattice for the context in Table\^ for simplic¬ 
ity some intents are not given. Extent of the outlined concept c is Ext(c) = 
{<?ij 52 ) 53 ) 54 }) thus, its powerset contains 2'^ elements. Descriptions of 5 sub¬ 
sets of Ext(c) (^{ 51 } ) ■. ■) { 54 } and %) are different from Int(c) = {me}, while 
all other subsets of Ext(c) have a common description equal to {me}- So, 
Stab(c) = = 0.69. 


One of the fastes t algorithm processing a concept lattice L is proposed 
Roth et al. 2008 with the worst-case complexity of 0(|Lp) where \L\ is 


the size of the concept lattice. The experimental section shows that for a big 
lattice, the stability computation can take much more time than the construction 
of the concept lattice. Thus, the estimation of concept stability is an important 
question. Here we present an efficient way for such an estimation. It should be 
noticed that in a lattice the extent of any ancestor of a concept c is a superset of 
the extent of c, while the extent of any descendant is a subset. Given a concept 
c and an immediate descendant d, we have Vs C Ext(d),s" C Ext{d), which 
means that s' 3 Int((i) D Int(c), i.e. s' yf Int(c). Thus, we can exclude in the 
computation of the numerator of stability in Q all subsets of the extent of a 
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direct descendant c. Thus, the following bound holds: 


Stab(c) < 1 — max , (2) 

dGDD(c)2^(‘=>^) 

where DD(c) is the set of all direct descendants and A(c, d) is the set-difference 
between extent of c and extent of d, A(c, d) = |Ext(c) \ Ext(d)|. 

Example 2. With help of ^ we can find all stable concepts (and some un¬ 
stable), i.e. the concepts with a high stability w.r.t. a threshold 9. If 9 = 0.97, 
we should compute for each concept c in the lattice the following value md{c) = 
min A{c,d) and then select concepts verifying md{c) > — log(l —0.97) = 5.06. 

dGDD(c) 


2.3 Pattern structures 


Although FCA applies to binary contexts, more complex data such as sequences 
or graphs can be directly processed as well. For that, pattern structures were 


introduced in Ganter and Kuznetsov 2001 


Definition 3. A pattern structure is a triple (G,{D,r\),6), where G is a set 
of objects, {D, n) is a complete meet-semilattice of descriptions and S : G ^ D 
maps an object to a description. 


The lattice operation in the semilattice (□) corresponds to the similarity 
between two descriptions. Standard FCA can be presented in terms of a pat¬ 
tern structure. In this case, G is the set of objects, the semilattice of descrip¬ 
tions is {p{M), n) and a description is a set of attributes, with the □ operation 
corresponding to the set intersection (p(M) denotes the powerset of M). If 
X = {a,b,c} and y = {a, c, d} then x r\ y = x Ci y = {a,c}. The mapping 
d : G —>■ p{M) is given by, S{g) = {m £ M \ {g,m) G /}, and returns the 
description for a given object as a set of attributes. 

The Galois connection for a pattern structure (G, (Zl,n),d) is defined as 
follows: 


:= PI S{g), for A C G 

g&A 

d^ := {g e G I d C (5(5)}, for d e L> 

The Galois connection makes a correspondence between sets of objects and 
descriptions. Given a subset of objects A, A* returns the description which 
is common to all objects in A. Given a description d, d^ is the set of all 
objects whose description subsumes d. More precisely, the partial order (or 
the subsumption order) on D (C) is defined w.r.t. the similarity operation □: 
cGd<t4>cnd = c, and c is subsumed by d. 

Definition 4. A pattern concept of a pattern structure (G, (I?, n),d) is a pair 
(A, d) where A C G and d £ D such that A^ = d and d* = A, A is called the 
concept extent and d is called the concept intent. 
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Table 3: Toy sequential data on patient medical trajectories. 


Patient 

Trajectory 


([i7i,{a}]; [Hi,{c, d}]; [Hi,{a,b}]- [idi,{d}]) 


{[H 2 , {c, d}]; [H 3 , {b, d}]; [H 3 , {a, d}]) 


{[H 4 , {c, d}]; [H 4 , {b}]; [H 4 , {a}]; [H 4 , {a, d}]) 


As in standard FCA, a pattern concept corresponds to the maximal set of 
objects A whose description subsumes the description d, where d is the maximal 
common description for objects in A. The set of all concepts can be partially 
ordered w.r.t. partial order on extents (dually, intent patterns, i.e C), within a 
concept lattice. 

An example of pattern structures is given in Table[^ while the corresponding 
lattice is depicted in Figure 

As stability of concepts only depends on extents, it can be defined by the 
same procedure for both formal contexts and pattern structures. 

3 Sequential pattern structures 

Certain phenomena, such as a patient trajectory (clinical history), can be con¬ 
sidered as a sequence of events. This section describes how FCA and pattern 
structures can process sequential data. 

3.1 An example of sequential data 

Imagine that we have medical trajectories of patients, i.e. sequences of hospi¬ 
talizations, where every hospitalization is described by a hospital name and a 
set of procedures. An example of sequential data on medical trajectories with 
three patients is given in Table[^ We have a set of procedures P — {a, b, c, d}, a 
set of hospital names Th = {Hi,H 2 , H 4 , CL, CH, *}, where hospital names 

are hierarchically organized (by level of generality). Hi and H 2 are central 
hospitals {CH), H 3 and H 4 are clinics (CL), and * denotes the root of this 
hierarchy. The least common ancestor in this hierarchy is denoted by hiFl h 2 , 
for any hi,h 2 G Th, i.e. HiFl H 2 = CH. Every hospitalization is described 
by one hospital name and may contain several procedures. The procedure or¬ 
der in each hospitalization is not important in our case. For example, the first 
hospitalization [H 2 , {c, d}] for the second patient {p^) was a stay in hospital H 2 
and during this hospitalization the patient underwent procedures c and d. An 
important task is to find the “characteristic” sequences of procedures and asso¬ 
ciated hospitals in order to improve hospitalization planning, optimize clinical 
processes or detect anomalies. 

We approach the search for characteristic sequences by finding the most sta¬ 
ble concepts in the lattice corresponding to a sequential pattern structure. For 
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the simplification of calculations, subsequences are considered without “gaps”, 
i.e the order of non consequent elements is not taken into account. This is rea¬ 
sonable in this task because experts are interested in regular consecutive events 
in healthcare trajectories. A sequential pattern structure is a set of sequences 
and is based on the set of maximal common subsequences (without gaps) be¬ 
tween two sequences. Next subsections define partial order on sequences and 
the corresponding pattern structures. 


3.2 Partial order on complex sequences 

A sequence is constituted of elements from an alphabet. The classical subse¬ 
quence matching task requires no special properties of the alphabet. Several 
generalizations of the classical case were made by introducing a subsequence 
relation based on an itemset alphabet [Agrawal and Srikant 1995 or on a 
multidimensional and multilevel alphabet [Plantevit et ah 2010 . Here, we 
generalize the previous cases, requiring for an alphabet to form a semilattice 
(A, He) (We should note that in this paper we consider two semilattices, the 
first one is related to the characters of the alphabet, {EjFIe), and the second 
one is related to pattern structures, {D, □)). Thanks to the formalism of pattern 
structures we are able to process in a unified way all types of sequential datasets 
with poset-shaped alphabet (it is mentioned above that any partial order can 
be transformed into a semilattice). However, some sequential data can have 
connections between elements, e.g. [Adda et al. 2010 , and, thus, cannot be 
straightforwardly processed by our approach. 


Definition 5. Given a semilattice {E,r\E), also called an alphabet, a sequence 
is an ordered list of elements from E. We denote it by (ei; 62 ; • • •; Sn) where 
Ci G E. 


In this alphabet semilattice (EjFIe) there is a bottom element Ee that 
can be matched with any other element. Formally, Ve G E,Ee = Ee 
e. This element is required by the lattice structure, but provides no useful 
information. Thus, it should be excluded from sequences. The bottom element 


of E corresponds to the empty set in sequential mining Agrawal and Srikant 


1995 , and the empty set is always ignored in this domain. 


Definition 6 . A valid sequence (ei;---;en) is a sequence where Ci ^ Ee for 
all i G {1, ■ ■ ■ ,n} . 


Definition 7. Given an alphabet (E,r\E) and two sequences t = (ti; ...;tfc) and 
s = (si;...;s„) based on E (tq,Sp G E), the sequence t is a subsequence of s, 
denoted t < s, iff k < n and there exist ji, ..j^ such that 1 < ji < ^2 < •■■ < 
jk < n and for all i G {1, 2, ..., k}, f \Ee sj^, i.e. tiFlE Sj. = ti. 

Example 3. In the running example (Section \3. 1^ , the alphabet is E = Th x 
p(P) with the similarity operation {hi, Pi) □ (/i2jP2) = {hi H /i2,Pi H P2), 
where hi,h2 G Th are hospitals and Pi,P2 G p{P) are sets of procedures. 
Thus, the sequence ss^ = ([CiJ, {c, d}]; [iJi, {6}]; [*, {d}]) is a subsequence of 






















= {[Hi,^}];[Hi,{c,d}]-,[Hi,{a,b}]-,[Hi,{d}]) because if we set jt =i + l 
(Definition^ then ssj C (‘CH’ is more general than Hi and {c, d} C {c, d}), 
SS 2 E p]^ (the same hospital and {6} C {b^a}) and ss\ E p]^ (‘*’ is more general 
than Hi and {d} C {d}). 

With complex sequences and this kind of subsequence relation the compu¬ 
tation can be hard. Thus, for the sake of simplification, only “contiguous” sub¬ 
sequences are considered, where only the order of consequent elements is taken 
into account, i.e. given ji in Definition ji = ji-i + 1 for all i G {2,3, 

Since experts are interested in regular consecutive events in healthcare trajec¬ 
tories, such a restriction does make sens for our data. It helps to connect only 
related hospitalizations. 

The next section introduces pattern structures that are based on complex se¬ 
quences with a general subsequence relation, while the experiments are provided 
for a “contiguous” subsequence relation. 


3.3 Sequential meet-semilattice 


Based on the previous definitions, we can define the sequential pattern struc¬ 
ture used for representing and managing sequences. For that, we make an 
analogy with the pattern structures for graphs Kuznetsov 1999 where the 


meet-semilattice operation n respects subgraph isomorphism. Thus, we intro¬ 
duce a sequential meet-semilattice respecting subsequence relation. Given an 
alphabet lattice (EjIIe), 6 is the set of all valid sequences based on (EjIIe)- 
6 is partially ordered w.r.t. Definition n) is a semilattice on 6, where 

D E p(6) such that, ii d G D contains a sequence s, then all subsequences of s 
should be included into d, ^s G d,$s < s : s ^ d, and the similarity operation 
is the set intersection for two sets of sequences. Given two patterns di,d 2 G D, 
the set intersection operation ensures that if a sequence s belongs to di □ d 2 
then any subsequence of s belongs to di □ d 2 and thus di\ld 2 G D. As the set 
intersection operation is idempotent, commutative and associative, [D, n) is a 
semilattice. 


Example 4. If pattern di G D includes sequence = ([*,{c, d}]; [*,{&}]) (see 
TafeZe [^, then it should include also ([*, {d}]; [*, {&}]), ([*,{c, d}]), ([*,{d}]) 
and others. If pattern d 2 G D includes = ([*,{a}]; [*,{d}]), then it should 
include ([*,{a}]), ([*){d}]) and (). Thus the intersection of two sets di and d 2 
is equal to the set {([*, {d}]), ()}. 

The next proposition stems from the aforementioned and will be used in the 
proofs in the next section. 

Proposition 1. Given (G, {D, n), 6 ) and x,y G D, x Qy if and only ifVs^ G x 
there is a sequence G y, such that s^ < s'^. 

The set of all possible subsequences for a given sequence can be large. Thus, 
it is more efficient to consider a pattern d G D as a set of only maximal sequences 
d, d = {s G d I $ 3 * G d : s* > s}. Furthermore, every pattern will be given only 
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Figure 3: The concept lattice for the pattern structure given by Table Con¬ 
cept intents reference to sequences in Tables [3| and HI 


Table 4: Subsequences of patient sequences in 


Table H 



Subsequences 

ss® 

{[GH,{c,d}];[H^,{b}]-[*Ad}]) 

ss® 

{[GH,{c,d}]-,[*,{b}]-[*,{d}]) 

ss® 

([Gil, {}];[*,{d}]; [*,{«}]) 

ss^ 

{[*,{c,d}];[*,{b}]) 

ss® 

{[*Aa}]} 

ss® 

{[*,{c,d}];[GL,{b}];[GL,{a}]) 

ss’^ 

{[GL,{d}]-,[CL,{}]) 

ss® 

{[GL,{}];[GL,{a,d}]) 

ss® 

{[GH,{c,d}]) 

ss™ 

([GL,{&}];[GL,{a}]) 

ss®^ 

([*,{c,d}]; [*,{&}]) 

ss®® 

{[*Aa}]-A*,{d}]) 


by the set of all maximal sequences. For example, {p^} H {p^} = {ss®, ss^, ss®} 
(see Tables and 1^, i.e. {ss®, ss^, ss®} is the set of all maximal sequences 
specifying the intersection of p^ and p®. Similarly we have {ss®, ss^, ss®} □ 
{p^} = Note that representing a pattern by the set of all maximal 

sequences allows for an efficient implementation of the intersection of two 
patterns (in Section 5.1 we give more details on similarity operation w.r.t. a 
contiguous subsequence relation). 


Example 5. The sequential pattern structure for our example (Suhsection \3.1^ 
is (G, (I?, n), 5), where G = {p®,p^,p®}, {D,r\) is the semilattice of sequential 
descriptions, and S is the mapping associating an object in G to a description 
in D shown in Ta6Ze[^ Figure^^shows the resulting lattice of sequential pattern 
concepts for this particular pattern structure (G, {D, □), <5). 
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4 Projections of sequential pattern structures 


Pattern structures are hard to process due to the large number of concepts in the 
concept lattice, the complexity of the involved descriptions and the similarity 
operation. Moreover, a given pattern structure can produce a lattice with a lot 
of patterns which are not interesting for an expert. Can we save computational 
time hy avoiding to compute “useless” patterns? Projections of pattern struc¬ 
tures “simplify” to some degree the computation and allow one to work with a 
reduced description. In fact, projections can be considered as filters on patterns 
respecting mathematical properties. These properties ensure that the projec¬ 
tion of a semilattice is a semilattice and that projected concepts are related to 

Moreover, the stability measure of 


original ones Ganter and Kuznetsov 2001 


projected concepts never decreases w.r.t the original concepts. 


projections on sequential patterns revising Ganter and Kuznetsov 2001 


We introduce 
It is 


necessary to provide an extended definition of projection in order to deal with 
interesting projections for real-world sequential datasets. 


Definition 8 (Ganter and Kuznetsov |2001| ). A projection ip : D ^ D is an in¬ 
terior operator, i.e. it is (1) monotone (xCy ^ 4’ix) C ip{y)), (2) contractive 
(ip{x) Cx) and (3) idempotent (ip{ip{x)) = ip{x)). 


Definition 9. A projected pattern structure ip{{G, {D, n), 6 )) is a pattern struc¬ 
ture (G, (Z?,/,, n,^),'0 o <5), where = ip{D) = {d € D \ 3d* G D : ip{d*) = d} 
and Vx, y G D, x y := ip{x □ y). 


Note that in Ganter and Kuznetsov 2001] ip{{G, {D, n), d)) = (G, {D, n), ipo 
S). Our definition allows one to use a wider set of projections. In fact all pro¬ 
jections that we describe for sequential pattern structures below require Defini¬ 
tion!^ Now we should show that (Dy,,n. 0 ) is a semilattice. 


Proposition 2. Given a semilattice {D, n) and a projection ip, for all x,y G D 
ipixUy) = ip{ip{x) n y). 


Proof. 1. ip{x) C X, thus, x,y A (x □ j/) □ {ip{x) H y) □ ip{ip{x) □ y) 


2. X G y => 0(x) G 0(y), thus, ip{x H y) □ ip{ip{x) □ y) 

3. ^{x n y) n ^(x) n y = ^(xnyjny = ^(xny), 

then {ip{x) n y) □ ip{x □ y) and ip{ip{x) n y) □ ip{ip{x □ y)) = ip{x □ y) 

4. From (2) and (3) it follows that ip{x H y) = ip{ip{x) □ y). 

□ 


Corollary 1 . Xi riy, X2 • • • riy, Xjq — ip{Xi n X2 n • • • n Xjsi) 
Proof. It can be prooven by induction. 

1 . Xi X 2 = ip{Xi n X 2 ) by Definition!^ 
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2. If • • • Hy, Xk = -ipiXi n • • • n Xk), then 

XiU^ ■ ■ ■ Xk Xk+1 = V'(^i n • • • n Xif) ^a:+i = 

= ^{ilj{x^n---nXK)nXK+i) = n • • • nx^+i) 

Proposition 

□ 


Corollary 2. Given a semilattice (-D,n) and a projection if), (£)^,n^) is a 
semilattice, i.e. is commutative, associative and idempotent. 


The concepts of a pattern structure and a projected pattern structure are 
connected through Proposition!^ This proposition can be found in fGanter and| 
Kuznetsov 2001|, but thanks to Corollary it is valid in our case. 


Proposition 3. Given a concept {A,d) in ij}{{G, {D,r\),S)), the extent A is an 
exdent in (G, {D,r\),S). Given a concept {A,d,j,) in ip{(G,{D,r\),5)), the intent 
djp is of the form d,p = ipid), where {A, d) is a concept in {G, {D, n), S). 


Moreover, while preserving the extents of some concepts, projections cannot 
decrease the stability of the projected concepts, i.e. if the projection preserves 
a stable concept, then its stability (Definitioncan only increase. 

Proposition 4. Given a pattern structure {G,{D,\1),6), its concept c and a 
projected pattern structure (G, o S), and the projected concept c, if 

the concept extents are equal (Ext(c) = Ext(c)^ then Stab(c) < Stab(c). 

Proof. Concepts c and c have the same extent. Thus, according to Definition!^ 
in order to prove the proposition, it is enough to prove that for any subset 
A C Ext(c), if A^ = Int(c) in the original pattern structure, then A^ = Iiit(c) 
in the projected one. 

Suppose that 3A C Ext(c) such that A'^ — Int(c) in the original pattern 
structure and A^ ^ Int(c) in the projected one. Then there is a descendant 
concept d of c in the projected pattern structure such that A^ = Int((i) in the 
projected lattice. Then there is an original concept d for the projected concept 
d with the same extent Ext{d). Then A* □ Int(d) □ Int(c) and, so, A^ cannot 
be equal to Int(c) in the original lattice. Contradiction. □ 


Now we are going to present two projections of sequential pattern structures. 
The first projection comes from the following observation. In many cases it may 
be more interesting to analyze quite long subsequences rather than short ones. 
This kind of projections is called Minimal Length Projection (MLP) and it 
depends on the minimal length parameter £ for the sequences in a pattern. The 
corresponding function maps a pattern without short sequences to itself, and 
a sequence with short sequences to the pattern containing only long sequences 
w.r.t. a given length threshold. Later, propositions !^ and !^ state that MLP is 
coherent with Definition !8l 


Definition 10. The function iPmlp '■ D ^ D of minimal length i is defined as 


ipMLpid) = {s G d I length(s) > £} 
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Example 6. If we prefer common subsequences of length £ > 3, then between 
and p^ in Table there is only one maximal common subsequence, ss® in 
Table\^ while ss’^ and ss® are too short to be considered. Figure ^a shows the 
lattice of the projected pattern structure with patterns of length greater 

or equal to 3. 


Proposition 5. The funetion tpMLP is a monotone, contractive and idempotent 
function on the semilattiee (13, n). 

Proof The contractivity and idempotency are quite clear from the definition. 
It remains to prove the monotonicity. 

If X C Y, where X and Y are sets of sequences, then for every sequence 
X G X there is a sequence y £ Y such that x < y (Proposition [^. We should 
show that ^/’(X) C 'f’fX), or in other words for every sequence x £ ifiX) there is 
a sequence y £ ifiY), such that x < y. Given x £ ipiX), since ipiX) is a subset 
of X and X C F, there is a sequence y £Y such that x < y, with \y\ >\x\>£ 
{£ is a parameter of MLP), and thus, y £ tfiY). □ 


Another important type of projections is related to a variation of the lattice 
alphabet {E,nE). One possible variation of the alphabet is to ignore certain 
fields in the elements. For example, if a hospitalization is described by a hospital 
name and a set of procedures, then either hospital or procedures can be ignored 
in similarity computation. For that, in any element the set of procedures should 
be substituted by 0, or the hospital by * (“arbitrary hospital”) which is the most 
general element of the taxonomy of hospitals. 

Another variation of the alphabet is to require that some held(s) should 
not be empty. For example, we want to find patterns with non-empty set of 
procedures or the element * of the hospital taxonomy is not allowed in elements 
of a sequence. Such variations are easy to realize within our approach. For this, 
when computing the similarity operation between elements of the alphabet, one 
should check if the result contains empty helds and, if yes, should substitute the 
result by _L. This variation is useful, as it is shown in the experimental section, 
but is rather difficult to define within more classical frequent sequence mining 
approaches, which will be discussed later. 

Example 7. An expert is interested in finding sequential patterns deseribing 
how a patient ehanges hospitals, but with little interest in procedures. Thus, any 
element of the alphabet lattiee, containing a hospital and a non-empty set of 
procedures can be projeeted to an element with the same hospital, but with an 
empty set of procedures. 

Example 8. An expert is interested in finding sequential patterns containing 
some information about the hospital in every hospitalization, and the corre¬ 
sponding procedures, i.e. hospital field in the patterns cannot be equal to e.g., 
ss® is an invalid pattern, while ss® is a valid pattern in Table Thus, any 
element of the alphabet semilattice with * in the hospital field can be projected 
to the Ye- Figure shows the lattice corresponding to the projected pattern 
strueture defined by a projection of the alphabet semilattice. 
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Below we formally define how the alphabet projection of a sequential pattern 
structure should be processed. Intuitively, every sequence in a pattern should 
be substituted with another sequence, by applying the alphabet projection to 
all its elements. However, the result can be an incorrect sequence, because -L^; 
cannot belong to a valid sequence. Thus, sequences in a pattern should be 
“developed” w.r.t. Le, as it is explained below. 


Definition 11. Given an alphabet (E,r\E), a projection of the alphabet ip and 
a sequence s = (si,--- , s„) based on E, the projection ipis) is the sequence 
S = (Si, • • • , Sn), such that Si = '0(Si)- 


Here, it should be noticed that s is not necessarily a valid sequence (see Def¬ 
inition!^, since it can include -L^; as an element. However, in sequential pattern 
structures, elements should include only valid sequences (see Section 3.3). 


Definition 12. Given an alphabet (E^FIe), a projection of the alphabet ip e j 
alphabet projection for the sequential pattern structure "pid) is the set of valid 
sequences smaller than the projected sequences from d: 


'ip{d) = {s G ©|(3< G d)s < '0£;(t)}, 


where © is the set of all valid sequences based on (E,r\E)- 

Example 9. {ss®} = {([*, {c, d}]; [CL, {&}]; [CL, {o}])} is an alphabet-projec¬ 
ted pattern for the pattern {ss^°} = {([CL, {&}]; [CL, {a}])}, where the alphabet 
lattice projection is given in Example 

In the case of contiguous subsequences, {([Cid, (c, d}])} is an alphabet-pro¬ 
jected pattern for the pattern {ss^} = {([Cid, (c, d}]; [*, {6}]; [*, {d}])}, where the 
alphabet lattice projection is given by projecting every element with medical pro¬ 
cedure b to the element with the same hospital and with the same set of procedures 
excluding b. The projection of sequence ss^ is {[CH, (c, d}]; [*, {}]; [*, {d}]), but 
[*,{}] = J-E, and, thus, in order to project the pattern {ss^} the projected se¬ 
quence is substituted by its maximal subsequences, i.e. 


P{{{[CH, (c, d}]; [*, {&}]; [*, {d}])}) = {([Cid, {c, d}])} . 

Proposition 6. Considering an alphabet {E,Fe), a projection of the alpha¬ 
bet Ip, a sequential pattern structure (C, (D,n),d), the alphabet projection (see 
Definitional^ is monotone, contractive and idempotent. 

Proof. This projection is idempotent, since the projection of the alphabet is 
idempotent and only the projection of the alphabet can change the elements 
appearing in sequences. 

It is contractive because for any pattern d G D and any sequences s G d. 


the projected sequences should be substituted by their subsequences in order to 
avoid Ee, building the sets {s®}. Thus, s is a supersequence for any s®, and, so, 
the projected pattern d = tp[d) is subsumed by the pattern d. 


a projection of the sequence s = ip^s) is a subsequence of s. In Definition 12 
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(a) MLP projection, / = 3 (b) Projection removing in the hos¬ 

pital field 


Figure 4: The projected concept lattices for the pattern structure given by 
Table Concept intents refer to the sequences in Tables and 


Finally, we should show monotonicity. Given two patterns x,y € D, such 
that X ^ y, i.e. Vs^ G x, 3s^ & y : < s'^, consider the projected sequence of , 

As s® < for some then for some jo < • • ■ < j|sx| (see Definitionj^ 
sf Eb (* G Ij 2,|s“|), then V’('Sf) Eb (by the monotonicity of 

the alphabet projection), i.e. the projected sequence preserves the subsequence 
relation. Thus, the set of allowed subsequences of s“ is a subset of the set 
of allowed subsequences of s^. Hence, the alphabet projection of the pattern 
preserves pattern subsumption relation, 'ijj{x) < ip{y) (Proposition]^, i.e. the 
alphabet projection is monotone. □ 


5 Sequential pattern structure evaluation 


5.1 Implementation 

Nearly any state-of-the-art FCA algorithm can be adapted to process pattern 
structures. We adapted the Addintent algorithm Merwe et al. 2004 , as the 


lattice structure is important for us to calculate stability (see an algorithm for 
calculating stability in Roth et al. 2008 ). To adapt the algorithm to our needs. 


every set intersection operation on attributes is substituted with the semilattice 
operation n on corresponding patterns, while every subset checking operation 
is substituted with the semilattice order checking Ej in particular all (•)' are 
substituted with (•)*. 

The next question is how the semilattice operation n and subsumption re¬ 
lation E can be implemented for contiguous sequences. Given two sets of se¬ 
quences S = {s^, ...s”} and T = {t^, ...,P"}, the similarity of these sets ^TlT, is 
calculated according to Section [3.3[ i.e. maximal sequences among all common 
subsequences for any pair of sequences s* and P. 

To find all common subsequences of two sequences, the following observations 
can be useful. If ss = {ssi;ssi) is a subsequence of s = (si;...;s„) with 
i.e. sSi Eb Sfes+i (Definition]^ fc® is the index difference from which 
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ss is a contiguous subsequence of s) and a subsequence of t = (fi; with 

jl = k*+i, i.e. ssi Qe tkt+i, then for any index i G { 1 , 2 ,...,?}, ssi Qe 
Thus, to find all maximal common subsequences of s and t, we first align s and 
t in all possible ways. For each alignment of s and t we compute the resulting 
intersection. Finally, we keep only the maximal intersected subsequences. 

For example, let us consider two possible alignments of and s^: 

= ({a};{c,c^}; (4 ) = ({a}; {c,d};{ 6 ,a}; (dj ) 

4= { {c,d}-,{b,d}-{a,d}) s'^ = ({c, d};{ 6 , d};{a, dj) 

ss'= ( 0 ; (dj ) ss’'= ({c,d}; { 5 } ; (dj ) 

The left intersection ss' is not retained, as it is not maximal (ss' < ss’’), while 
the right intersection ss’’ is kept. 

The complexity of the alignment for two sequences s and t is 0(|s| • |t| • 7 ), 
where 7 is the complexity of computing a common ancestor in the alphabet 
lattice (£l,n). 


5.2 Experiments and discussion 

The experiments are carried out on a MacBook Pro with a 2.5GHz Intel Core i5, 
8 GB of RAM Memory running OS X 10.6.8. The algorithms are not parallelized 
and are coded in C++. 

Our use-case dataset comes from a French healthcare system, called PMSQ 
Fetter et al.[ 1980] . Each element of a sequence has a “complex” nature. The 


dataset contains 500 patients suffering from lung cancer, who live in the Lorraine 
region (Eastern Erance). Every patient is described as a sequence of hospitaliza¬ 
tions without any time-stamp. A hospitalization is a tuple with three elements: 
(i) healthcare institution (e.g. university hospital of Nancy (CHUNancy)), (h) 
reason for the hospitalization (e.g. a cancer disease), and (iii) set of medical 
procedures that the patient undergoes. An example of a medical trajectory is 
given below: 


([CHUiVancH, Cancer, {mpi, mp 2 }] ; [CHparis, Chemo, {}] ; [CHpaWs, Chemo, {}]) . 


This sequence represents a patient trajectory with three hospitalizations. It 
expresses that the patient was hrst admitted to the university hospital of Nancy 
(CHUNancy) for a cancer problem as a reason, and underwent procedures mpi 
and mp 2 ■ Then he had two consequent hospitalizations in the general hospital of 
Paris {CHparis) for chemotherapy with no additional procedure. Substituting 
the same consequent hospitalizations by the number of repetitions, we have a 
shorter and more understandable trajectory. Eor example, the above pattern 
is transformed into two hospitalizations where the first hospitalization repeats 
once and the second twice: 

([CHUjvancj/, Cancer, {mpi,mp 2 }] x [1]; [CHparis, Chemo, {}] x [2]). 

r Programme de Medicalisation des Sytemes d’Information. 
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Figure 5: A geographical taxonomy of the healthcare institution 


Trajectory Length 
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Trajectory Length 


Figure 6: The length distribution of sequences in the dataset 


Diagnoses are coded according to the 10*^ International Classification of 
Diseases (ICDIO). Based on this coding, diagnoses could be described at 5 
levels of granularity: root, chapter, block, 3-character, 4-character, terminal 
nodes. This taxonomy has 1544 nodes. The healthcare institution is associ¬ 
ated with a geographical taxonomy of 4 levels, where the first level refers to 
the root (France) and the second, the third and the fourth levels correspond 
to administrative region, administrative department and hospital respectively. 
Figure [^presents University Hospital of Nancy (code: 540002078) as a hospital 
in Meurthe et Moselle, which is a department in Lorraine, region of France. This 
taxonomy has 304 nodes. The medical procedures are coded according to the 
French nomenclature “Classification Commune des Actes Medicaux (CCAM)”. 
The distribution of sequence lengths is shown in Figure]^ 

With 500 patient trajectories, the computation of the whole lattice is in¬ 
feasible. We are not interested in all possible frequent trajectories, but rather 
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Database Size 
(a) MLP projection, 1 = 2 



Figure 7: Computational time for different projections 


in trajectories which answer medical analysis questions. An expert may know 
the minimal size of trajectories that he is interested in, i.e. setting the MLP 
projection. We use the MLP projection of length 2 and 3 and take into account 
that most of the patients has at least 2 hospitalizations in the trajectory (see 
Figure [^. 

Figurej^shows computational times for different projections as a function of 
dataset size. Figure [Ta] shows different alphabet projections for MLP projection 
with £ = 2, while Figure [7b| for MLP with £ = 3. Every alphabet projection 
is given by the name of fields, that are considered within the projection: G 
corresponds to hospital geo-location, R is the reason for a hospitalization, P is 
medical procedures and I is repetition interval, i.e. the number of consequent 
hospitalizations with the same reason. We can see from these figures that MLP 
allows one to save some computational resources with increasing of £. The dif¬ 
ference in computational time between £ = 2 and £ = 3 projections is significant, 
especially for time consuming cases. Even a bigger variation can be noticed for 
the alphabet projections. For example, computation of the RPI projection takes 
100 times more resources than any from GRP, RP, GR, GRP. 

The same dependency can be seen in Figurej^ where the number of concepts 
for every projection is shown. Consequently, it is important for an expert to 
provide a strict projection that allows him to answer his questions in order to 
save computational time and memory. 

Table shows some interesting concept intents with the corresponding sup¬ 
port and ranking w.r.t. concept stability. For example the concept ^1 is ob¬ 
tained under the projection GR (i.e., we consider only hospital and reason), with 
the intent ([Lorraine, (7341 Lung Cancer]), where C341 Lung Cancer is a spe¬ 
cial kind of lung cancer (malignant neoplasm in Upper lobe, bronchus or lung). 
This concept is the most stable concept in the lattice for the given projection, 
and the size of the concept extent is 287 patients. 
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(a) MLP projection, 1 = 2 (b) MLP projection, f = 3 

Figure 8: Lattice size for different projections 


Table 5: Interesting concepts, for different projections. 


# 

Projection 

Intent 

Stab. Rank 

Support 

i 

GR 

{[Lorraine, C34:l Lung CancerJ) 

1 

287 

2 

GR.2 

{[Lorraine, Respiratory Disease]; [CHUNancyiLung Cancer]) 

26 

22 

3 

GR3 

{[Lorraine, Chemotherapy] x 4) 

1 

176 

4 

RP13 

{[Preparation for Chemotherapy, {Lung Radiography {]; [Chemotherapy] x [3,4]) 

5 

36 
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One of the questions that the analyst would like to address here is “Where 
do patients stay (i.e. hospital location) during their treatment, and for which 
reason To answer this question, we consider only healthcare institutions 
and reason fields, requiring both to “hold” some information and we use the 
MLP projection of length 2 and 3 (i.e. projections GR2 and GR3). Nearly all 
frequent trajectories show that patients usually are treated in the same region. 
However, pattern #2 obtained under GR2 projection shows that, “22 patients 
were first admitted in some healthcare institution in Lorraine region for a prob¬ 
lem related to the respiratory system and then they were treated for a lung cancer 
in University Hospital of Nancy. ” 

Another interesting question is “What are the seguential relations between 
hospitalization reasons and the corresponding procedures?”. To answer this ques¬ 
tion, we are not interested in healthcare institutions. Thus, any alphabet ele¬ 
ment is projected by substituting healthcare institution held with As hos¬ 
pitalization reason is important in each hospitalization, any alphabet element 
without the hospitalization reason is of no use and is projected to the bottom el¬ 
ement J-E of the alphabet. Such projections are called RPI2 or RPI3, meaning 
that we consider the helds “Reason” and “Procedures”, while the reason should 
not be empty and the MLP parameter is 2 or 3. Pattern ff). trivially states 
that, “36 patients with lung cancer are hospitalized once for the preparation 
of chemotherapy and during this hospitalization they undergo lung radiography. 
Afterwards, they are hospitalized between 3 and 4 times for chemotherapy. ” 

Variability is high in healthcare processes and affects many aspects of health¬ 
care trajectories: patients, medical habits and protocols, healthcare organisa¬ 
tion, availability of treatments and settings... Mining sequential pattern struc¬ 
tures is an interesting approach for finding regularities across one or several 
dimensions of medical trajectories in a population of patients. It is flexible 
enough to help healthcare managers to answer specific questions regarding the 
natural organisation of care processes and to further compare them with ex¬ 
pected or desirable processes. The use of taxonomies plays also a key role in 
finding the right level of description of sequential patterns and reducing the 
interpretation overhead. 


6 Related work 


Agrawal and Srikant 1995 introduced the problem of mining sequential pat¬ 


terns over large sequential databases. Formally, given a set of sequences, where 
each sequence is a list of transactions ordered by time and each transaction is a 
set of items, the problem amounts to find all frequent subsequences that appear 
a sufficient number of times with a user-specihed minimum support threshold 
{minsup). Following the work of Agrawal and Srikant many studies have con¬ 
tributed to the efficient mining of sequential patterns Mooney and Roddick 
2013| . Most of them are based on the antimonotonicity property (used in Apri- 


ori), which states that any super pattern of a non-frequent pattern cannot be 
frequent. The main algorithms are PrefixSpan Pei et al. 2001b|, SPADE [Zaki 
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2001 , SPAM Ayres et al. 

2002 , PSP Masseglia et al. 

1998 , DISC Chiu 

et al. 

2004 , PAID Yang et al. 

2006 and FAST Salvemini et al. 

2011]. All 


these algorithms aim at discovering sequential patterns from a set of sequences 
of itemsets such as customers who frequently buy DVDs of episodes I, II and III 
of Stars Wars, then buy within 6 months episodes IV, V, VI of the same famous 
epic space opera. 

Many studies about sequential pattern discovery focus on single-dimensional 
sequences. However, in many situations, the database is multidimensional in 
the sense that items can be of different nature. For example, a consumer 
database can hold information such as article price, gender of the customer, 
location of the store and so on. Pinto et al. [2001| proposed the first work for 
mining multidimensional sequential patterns. In this work, a multidimensional 
sequential database is defined as a schema {ID, Di,Dm, S), where ID is a 
unique customer identifier, Di,Dm are dimensions describing the data and S 
is the sequence of itemsets. A multidimensional sequence is defined as a vector 
{{di,d 2 , ...,dm}, Si,S 2 ,..., Si) where di € A for {i < m) and 81 , 82 , ■■■, Si, are 
the itemsets of sequence 8 . For instance, {{Metz, Male}, {mpi, 'mp 2 },{mp 3 }) 
describes a male patient who underwent procedures mpi and mp 2 in Metz and 
then underwent mp 3 also in Metz. Here, dimensions remain constant over time, 
such as the location of the treatment. This means that it is not possible to have 
a pattern indicating that when the patient underwent procedures mpi and mp 2 
in Metz then he underwent mp 3 in Nancy. Among other proposals,] Yu and Chen| 


2005 proposed two methods AprioriMD and PrefixMDSpan for mining multi¬ 


dimensional sequential patterns in the web domain. This study considers pages, 
sessions and days as dimensions. Actually, these three different dimensions can 
be projected into a single dimension corresponding to web pages, gathering web 
pages visited during a same session and ordering sessions w.r.t the day as order. 

In real world applications, each dimension can be represented at different 
levels of granularity, by using a poset. For example, apples in a market basket 
analysis can be either described as fruits, fresh food or food. The interest lies in 
the capacity of extracting more or less general/specific multidimensional sequen¬ 
tial patterns and overcome problems of excessive granularity and low support. 


Srikant and Agrawal 1996 proposed GSP which uses posets for extracting se¬ 


quential patterns. The basic approach is based on replacing every item with all 
the ancestors in the poset and then the frequent sequences are generated. This 
approach is not scalable in a multidimensional context because the size of the 
database becomes the product of maximum height of the posets and number of 
dimensions. 


Plantevit et al. 2010 defined a multidimensional sequence as an ordered list 
of multidimensional items, where a multidimensional item is a tuple {di ,..., dm) 
and di is an item associated with the dimension. They proposed M^SP, an 
approach taking both aspects into account where each dimension is represented 
at different levels of granularity, by using a poset. M^SP is able to search for se¬ 
quential patterns with the most appropriate level of granularity. Their approach 
is based on the extraction of the most specific frequent multidimensional items, 
which are then used as alphabet to rephrase the original database. Then, M^SP 
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uses a standard sequential pattern mining algorithm to extract multidimen¬ 
sional sequential patterns. However, M^SP is not adapted to mine sequential 
databases, where sequences are defined over a combination of sets of items and 
items lying in a poset. Then it is not possible to have a pattern indicating that 
when the patient went to uhp for a problem of cancer ca, where he underwent 
procedures mpi and mp 2 , then he went to ghi for the same medical problem co, 
where he underwent mp 3 ( i.e, {{uhp,ca,{mpi,mp 2 }),{ghi,ca,{mp 3 }))). Our 
approach allows us to process such kind of patterns and in addition the elements 
of sequences are even more general. For example, beside multidimensional and 
multilevel sequences, sequences of graphs fall under our definition. Moreover, 
frequent subsequence mining gives rise to a lot of subsequences which can be 
hardly analyzed by an expert. Since our approach is based on Formal Concept 
Analysis (FCA) [Ganter and Wille 1999 , we can use efficient relevance indexes 
defined in FCA. 

This paper is not the first attempt to use FCA for the analysis of sequential 
data. Ferre 2007 processes sequential datasets based on a “simple” alphabet 
without involving any partial order. In Casas-Garriga 2005 only sequences 


of itemsets are considered. All closed subsequences are firstly mined and then 
regrouped by a specialized algorithm in order to obtain a lattice similar to 
the FGA lattice. This approach was not verified experimentally. Moreover, 
compared with both approaches, i.e. Ferre 2007 and Gasas-Garriga [2005] , 
our approach suggests a more general definition of sequences and, thanks to 
pattern structures, there is no ‘pre-mining’ step to find frequent (or maximal) 
subsequences. This allows us to apply different “projections” specializing the 
request of an expert and simplifying the computations. In addition, in our 
approach nearly all state-of-the-art FGA algorithms can be used in order to 
efficiently process a dataset. 

There is a number of approaches that help to analyze medical treatment 
data. However, the direct comparison of them is hardly possible, because ev¬ 
ery approach is designed for its own problem. For example, Tsumoto et al. 


2014 analyze data of one hospital and provide a different view on the processes 


within the hospital w.r.t. our approach. Finally and naturally, the most similar 
approach to our work can be found in Egho et al. 20I4a|b 


as some authors of 


the present paper are involved in this alternative work. In Egho et al. 20I4a|b 


authors mine frequent sequences of the dataset similar to the sequences studied 
here. However, they approach the complexity of the analysis of such data in a 
different way. They use a support threshold in order to specify the outcome of 
the algorithm and do not provide any order in which one can analyze the result. 
In our case we rely on projections that are usually simpler to incorporate expert 
knowledge than a support threshold and we give an order (w.r.t. stability of a 
concept) which can be used to simplify the analysis of the treatment data. 
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7 Conclusion 


In this paper, we have presented a novel approach for analyzing sequential data 
within the framework of pattern structures, an extension of Formal Concept 
Analysis dealing with complex data. It is based on the formalism of sequential 
pattern structures and projections. Our work complements the general orienta¬ 
tions towards statistically significant patterns by presenting strong formal results 
on the notion of interestingness from a concept lattice viewpoint. The frame¬ 
work of pattern structures is very flexible and shows some important properties, 
for example in allowing to reuse state-of-the-art and efficient FCA algorithms. 
Using pattern structures leads to the construction of a pattern concept lattice, 
which does not require the setting of a support threshold, as usually needed 
in classical sequential pattern mining. Moreover, the use of projections gives a 
lot of flexibility especially for mining and interpreting special kinds of patterns 
(patterns can be proposed at several levels of complexity w.r.t. extraction and 
interpretation). 

Our framework was tested on a real-world dataset with patient hospitaliza¬ 
tion trajectories. Interesting patterns answering questions of an expert are ex¬ 
tracted and interpreted, showing the feasibility and usefulness of the approach, 
and the importance of the stability as a pattern-selection procedure. In partic¬ 
ular, projections play an important role here: mainly, they provide means to 
select patterns of a special interest and they help to save computational time 
(which could be otherwise very large). 

For future work, we are planning to more deeply investigate projections, 
their potential w.r.t. the types of patterns. It can be interesting to introduce 
and evaluate the stability measure directly on sequences. Another research di¬ 


rection is mining of association rules or building a Horn approximation Balcazar 


and Casas-Garriga 2005 from the stable part of the pattern lattice or stable 


sequences. Finally, as discussed above, a precise study combining frequent sub¬ 
sequence mining and FCA-based approaches should be carried out. 
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